The Goal here is to clean the data and analyze the dataset against life expectancy with linear regressions and visualizations¶

What is the impact on time on the life expectancy?

What is the impact of schooling on life expectancy?

How does Infant and Adult mortality rates affect life expectancy?

Do densely populated countries tend to have lower life expectancy?

What is the impact of GDP and income onn life expectancy?

Does Life Expectancy have positive or negative relationship with drinking alcohol?

Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, alcohol?

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import statsmodels.formula.api as smf
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Viewing data¶

In [2]:
dataset = pd.read_csv('Life Expectancy Data.csv')
In [3]:
dataset.shape
Out[3]:
(2938, 22)
In [4]:
# viewing columns and their data types
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15   HIV/AIDS                        2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18   thinness  1-19 years            2904 non-null   float64
 19   thinness 5-9 years              2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB

Cleaning data¶

In [5]:
# removing the spaces to the right and left of the column names

dataset.columns = dataset.columns.str.rstrip() 
dataset.columns = dataset.columns.str.lstrip() 
In [6]:
# replacing remaining spaces in columns with under_bar

dataset.columns = dataset.columns.str.replace(' ', '_')

# replacing dashes in columns with under_bar
dataset.columns = dataset.columns.str.replace('-', '_')
In [7]:
# changing column headers to lowercase
dataset.columns = map(str.lower, dataset.columns)
In [8]:
# Checking the new format of the column names

dataset.columns.to_list()
Out[8]:
['country',
 'year',
 'status',
 'life_expectancy',
 'adult_mortality',
 'infant_deaths',
 'alcohol',
 'percentage_expenditure',
 'hepatitis_b',
 'measles',
 'bmi',
 'under_five_deaths',
 'polio',
 'total_expenditure',
 'diphtheria',
 'hiv/aids',
 'gdp',
 'population',
 'thinness__1_19_years',
 'thinness_5_9_years',
 'income_composition_of_resources',
 'schooling']

Inspecting Null Values¶

In [9]:
# checking for null value counts
dataset.isna().sum()
Out[9]:
country                              0
year                                 0
status                               0
life_expectancy                     10
adult_mortality                     10
infant_deaths                        0
alcohol                            194
percentage_expenditure               0
hepatitis_b                        553
measles                              0
bmi                                 34
under_five_deaths                    0
polio                               19
total_expenditure                  226
diphtheria                          19
hiv/aids                             0
gdp                                448
population                         652
thinness__1_19_years                34
thinness_5_9_years                  34
income_composition_of_resources    167
schooling                          163
dtype: int64
In [10]:
# checking for null value percentages
dataset.isnull().sum()/len(dataset)*100
Out[10]:
country                             0.000000
year                                0.000000
status                              0.000000
life_expectancy                     0.340368
adult_mortality                     0.340368
infant_deaths                       0.000000
alcohol                             6.603131
percentage_expenditure              0.000000
hepatitis_b                        18.822328
measles                             0.000000
bmi                                 1.157250
under_five_deaths                   0.000000
polio                               0.646698
total_expenditure                   7.692308
diphtheria                          0.646698
hiv/aids                            0.000000
gdp                                15.248468
population                         22.191967
thinness__1_19_years                1.157250
thinness_5_9_years                  1.157250
income_composition_of_resources     5.684139
schooling                           5.547992
dtype: float64
In [11]:
# visualizaing the na values

import missingno as msno
msno.bar(dataset)
Out[11]:
<AxesSubplot:>

Since I'm largely basing this anlysis on Life expectancy, I want to ensure that only countries with sufficient life expectancy data remains in the data set.¶

Identifying countries that dont appear enough. Every country in the list should optimally have 16 occurences for each of the 16 years in this dataset. The countries below only appear once in the entire dataset - representing one year of data. they will be removed¶

In [12]:
dx = dataset.value_counts('country').sort_values(0).reset_index().head(20)
dx = dx.country[:10].tolist()
dx
Out[12]:
['Dominica',
 'San Marino',
 'Cook Islands',
 'Marshall Islands',
 'Tuvalu',
 'Saint Kitts and Nevis',
 'Palau',
 'Niue',
 'Nauru',
 'Monaco']
In [13]:
# removing countries from the  dataset using list in the previously created variable dx
dataset = dataset[dataset.country.isin(dx) == False]
In [14]:
# confirming that countries in the dx variable are gone

dataset.value_counts('country').sort_values(0).reset_index().head(5)

# ordered from least to greatest and checking the first 5 values
Out[14]:
country 0
0 Afghanistan 16
1 Botswana 16
2 Algeria 16
3 Angola 16
4 Antigua and Barbuda 16
In [15]:
#Filling NAs in each column with each country's average for that column

dataset["alcohol"] = dataset.groupby("country")['alcohol'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["hepatitis_b"] = dataset.groupby("country")['hepatitis_b'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["bmi"] = dataset.groupby("country")['bmi'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["polio"] = dataset.groupby("country")['polio'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["total_expenditure"] = dataset.groupby\
("country")['total_expenditure'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["diphtheria"] = dataset.groupby("country")['diphtheria'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["gdp"] = dataset.groupby("country")['gdp'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["population"] = dataset.groupby("country")['population'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["income_composition_of_resources"] = dataset.groupby\
("country")['income_composition_of_resources'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["schooling"] = dataset.groupby("country")['schooling'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["thinness__1_19_years"] = dataset.groupby\
("country")['thinness__1_19_years'].transform(lambda x: x.fillna(method = 'ffill').bfill())

dataset["thinness_5_9_years"] = dataset.groupby\
("country")['thinness_5_9_years'].transform(lambda x: x.fillna(method = 'ffill').bfill())
In [16]:
import missingno as msno
#checking null value counts again
msno.bar(dataset)
Out[16]:
<AxesSubplot:>

there are still null values remaining. this tells me that the fill and back fill functions could not compute and there are countries missing all values in a column¶

In [17]:
# testing my thought - checking for countries with all null values in the gdp column, which still has NAs

#grouping by country and counting the null values
gdp_null = dataset.gdp.isnull().groupby(dataset['country']).sum().reset_index()

#locating countries with more than 1 null value for gdp
gdp_null = gdp_null.loc[gdp_null['gdp'] > 1]
gdp_null.count()
Out[17]:
country    25
gdp        25
dtype: int64

there are 25 countries with all null values in the gdp column. i will conditionally continue to address these types of null values whenever i may need to use the column.¶

Checking for duplicated rows¶

In [18]:
dataset.duplicated().value_counts()

#there are no duplicated rows
Out[18]:
False    2928
dtype: int64

Viewing the distribution/correlations¶

In [19]:
# Viewing distribution of all data

dataset.hist(alpha = .5, bins = 60, figsize = (20,50), layout=(10,2))
plt.tight_layout()
plt.show()

There are some large differences between max and min in the distribution plots. I will allow these values to remain, as such data may be relevant as there are countries with both large and small populations in this dataset¶

In [20]:
#counting the amount of developing and developed counties in the dataset
plt.figure(figsize=(12,8))
sns.countplot(data=dataset, x= 'status', order=dataset["status"].value_counts().index, palette= "husl")
Out[20]:
<AxesSubplot:xlabel='status', ylabel='count'>
In [21]:
# assessing the differences between developed and developing countries
dataset.groupby('status').mean().reset_index()
Out[21]:
status year life_expectancy adult_mortality infant_deaths alcohol percentage_expenditure hepatitis_b measles bmi ... polio total_expenditure diphtheria hiv/aids gdp population thinness__1_19_years thinness_5_9_years income_composition_of_resources schooling
0 Developed 2007.5 79.197852 79.685547 1.494141 9.719336 2703.600380 83.302083 499.005859 51.803906 ... 93.736328 7.582148 93.476562 0.100000 22053.386446 6.830053e+06 1.320703 1.29668 0.852489 15.845474
1 Developing 2007.5 67.111465 182.833195 36.534768 3.422600 324.262018 74.617500 2836.618791 35.321351 ... 79.882450 5.577131 79.654801 2.096896 4217.160380 1.405700e+07 5.608725 5.65130 0.582092 11.225130

2 rows × 21 columns

In [22]:
# generating numeric correlation plot
dataset.corr()
Out[22]:
year life_expectancy adult_mortality infant_deaths alcohol percentage_expenditure hepatitis_b measles bmi under_five_deaths polio total_expenditure diphtheria hiv/aids gdp population thinness__1_19_years thinness_5_9_years income_composition_of_resources schooling
year 1.000000 0.170033 -0.079052 -0.036464 -0.077391 0.032723 0.243332 -0.081840 0.104668 -0.041980 0.103324 0.089489 0.142635 -0.138789 0.101017 0.016712 -0.045082 -0.048152 0.242953 0.213265
life_expectancy 0.170033 1.000000 -0.696359 -0.196557 0.402874 0.381864 0.319101 -0.157586 0.567694 -0.222529 0.460142 0.230900 0.474818 -0.556556 0.461511 -0.021371 -0.477183 -0.471584 0.724776 0.751975
adult_mortality -0.079052 -0.696359 1.000000 0.078756 -0.197702 -0.242860 -0.180219 0.031176 -0.387017 0.094146 -0.273500 -0.127032 -0.274592 0.523821 -0.297521 -0.013897 0.302904 0.308457 -0.457626 -0.454612
infant_deaths -0.036464 -0.196557 0.078756 1.000000 -0.114743 -0.085906 -0.219671 0.501038 -0.227480 0.996628 -0.166995 -0.128293 -0.171621 0.024955 -0.108046 0.556815 0.465700 0.471340 -0.145018 -0.195202
alcohol -0.077391 0.402874 -0.197702 -0.114743 1.000000 0.335397 0.092529 -0.050301 0.336358 -0.111825 0.226894 0.301697 0.224955 -0.047298 0.354459 -0.034409 -0.427377 -0.416304 0.446448 0.540703
percentage_expenditure 0.032723 0.381864 -0.242860 -0.085906 0.335397 1.000000 -0.001324 -0.056831 0.231130 -0.088152 0.148875 0.167788 0.145417 -0.098230 0.899650 -0.025576 -0.252397 -0.253931 0.382244 0.391466
hepatitis_b 0.243332 0.319101 -0.180219 -0.219671 0.092529 -0.001324 1.000000 -0.154028 0.216539 -0.230773 0.487581 0.114038 0.588850 -0.126628 0.068584 -0.087231 -0.165324 -0.177734 0.276544 0.298144
measles -0.081840 -0.157586 0.031176 0.501038 -0.050301 -0.056831 -0.154028 1.000000 -0.176069 0.507718 -0.132613 -0.103498 -0.138292 0.030673 -0.076057 0.265990 0.224579 0.220836 -0.129465 -0.138344
bmi 0.104668 0.567694 -0.387017 -0.227480 0.336358 0.231130 0.216539 -0.176069 1.000000 -0.237910 0.280603 0.239904 0.278867 -0.243735 0.303799 -0.071633 -0.530805 -0.537784 0.509299 0.558363
under_five_deaths -0.041980 -0.222529 0.094146 0.996628 -0.111825 -0.088152 -0.230773 0.507718 -0.237910 1.000000 -0.184896 -0.130034 -0.191998 0.037783 -0.111872 0.544437 0.467771 0.472244 -0.163185 -0.210945
polio 0.103324 0.460142 -0.273500 -0.166995 0.226894 0.148875 0.487581 -0.132613 0.280603 -0.184896 1.000000 0.148531 0.679621 -0.156445 0.217112 -0.036116 -0.218594 -0.219433 0.389327 0.424306
total_expenditure 0.089489 0.230900 -0.127032 -0.128293 0.301697 0.167788 0.114038 -0.103498 0.239904 -0.130034 0.148531 1.000000 0.157033 -0.005506 0.139612 -0.077157 -0.277092 -0.284654 0.183496 0.277368
diphtheria 0.142635 0.474818 -0.274592 -0.171621 0.224955 0.145417 0.588850 -0.138292 0.278867 -0.191998 0.679621 0.157033 1.000000 -0.162214 0.206914 -0.026114 -0.225675 -0.219046 0.410487 0.433396
hiv/aids -0.138789 -0.556556 0.523821 0.024955 -0.047298 -0.098230 -0.126628 0.030673 -0.243735 0.037783 -0.156445 -0.005506 -0.162214 1.000000 -0.135058 -0.027818 0.203550 0.206772 -0.249380 -0.222214
gdp 0.101017 0.461511 -0.297521 -0.108046 0.354459 0.899650 0.068584 -0.076057 0.303799 -0.111872 0.217112 0.139612 0.206914 -0.135058 1.000000 -0.027784 -0.288711 -0.293229 0.457725 0.445368
population 0.016712 -0.021371 -0.013897 0.556815 -0.034409 -0.025576 -0.087231 0.265990 -0.071633 0.544437 -0.036116 -0.077157 -0.026114 -0.027818 -0.027784 1.000000 0.253449 0.250954 -0.008319 -0.031193
thinness__1_19_years -0.045082 -0.477183 0.302904 0.465700 -0.427377 -0.252397 -0.165324 0.224579 -0.530805 0.467771 -0.218594 -0.277092 -0.225675 0.203550 -0.288711 0.253449 1.000000 0.938953 -0.422210 -0.477434
thinness_5_9_years -0.048152 -0.471584 0.308457 0.471340 -0.416304 -0.253931 -0.177734 0.220836 -0.537784 0.472244 -0.219433 -0.284654 -0.219046 0.206772 -0.293229 0.250954 0.938953 1.000000 -0.410825 -0.466334
income_composition_of_resources 0.242953 0.724776 -0.457626 -0.145018 0.446448 0.382244 0.276544 -0.129465 0.509299 -0.163185 0.389327 0.183496 0.410487 -0.249380 0.457725 -0.008319 -0.422210 -0.410825 1.000000 0.800046
schooling 0.213265 0.751975 -0.454612 -0.195202 0.540703 0.391466 0.298144 -0.138344 0.558363 -0.210945 0.424306 0.277368 0.433396 -0.222214 0.445368 -0.031193 -0.477434 -0.466334 0.800046 1.000000
In [23]:
# plottiing the numerical correlation against life_expectancy into a bar chart
pd.DataFrame(abs(dataset.corr()['life_expectancy'].\
                 drop('life_expectancy')*100).sort_values(ascending=False)).plot.bar(figsize = (12,8))
plt.yticks(size = 15)
plt.xticks(size = 12)
plt.show()

There are several columns that correlate with life expectancy in this dataset, according to the plot above¶

In [24]:
# generating correlation matrix
plt.figure(figsize=(20,12))
sns.heatmap(dataset.corr(),annot=True)
Out[24]:
<AxesSubplot:>

# plotting such a large number of correlations is not always SUPER useful.. but worth a look to note that there are many correlations here¶

In [25]:
sns.pairplot(dataset)
Out[25]:
<seaborn.axisgrid.PairGrid at 0x7ff228dae6d0>

What is the affect of time (in years) on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [26]:
%matplotlib widget

# adding list of countries ot a list
country = np.unique(dataset.country.tolist()).tolist()

# extract color palette into a list, the palette can be changed
pal = list(sns.color_palette(palette=('Set2'), n_colors=len(country)).as_hex())

fig = go.Figure()

# looping through each country in the dataset, and plotting eaach data point for each year
for d,p in zip(country, pal):
    fig.add_trace(go.Scatter(x = dataset[dataset['country']==d]['year'],
                             y = dataset[dataset['country']==d]['life_expectancy'],
                             #visible = 'legendonly',
                             name = d,
                             line_color = p, 
                             fill=None))  #tozeroy 

# The code in below will generate an interactive plot. You can double-click the legend to display 
#all data or remove all data click on a single (or multiple) country name in the legend to display data. 
fig.show()

Generally, time (in the form of years) has had a positive impact in life expectancy across most countries.¶

In [27]:
# modeling life_expectancy and year below via linear regression

year_model = smf.ols(formula = 'life_expectancy ~ year',
               data = dataset).fit()
print(year_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.029
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     87.11
Date:                Thu, 29 Dec 2022   Prob (F-statistic):           1.96e-20
Time:                        22:29:51   Log-Likelihood:                -10710.
No. Observations:                2928   AIC:                         2.142e+04
Df Residuals:                    2926   BIC:                         2.144e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -635.8715     75.546     -8.417      0.000    -783.999    -487.744
year           0.3512      0.038      9.333      0.000       0.277       0.425
==============================================================================
Omnibus:                      178.257   Durbin-Watson:                   0.151
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              190.835
Skew:                          -0.593   Prob(JB):                     3.64e-42
Kurtosis:                       2.606   Cond. No.                     8.74e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.74e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

P value of above regression:¶

In [29]:
year_model.pvalues[0]
Out[29]:
5.957756451252083e-17

The p-value for this linear regression is increadibly low. Meaning, that there is a very slim chance that the correlation between life expectancy and year is by coincidence. Thus, as countries grow older, it's citizens will live longer. However, only 2.9% of the life expectancy variance in this model is determined by years, according to the R-squared value. It would seem that humans are continuing to adapt to their unique environments, and thus finding more ways to protect themselves from the deteriments of their environments, on an individual/national level, and living longer over time¶

What is the affect of schooling on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [30]:
px.scatter(dataset, x='life_expectancy',y='schooling',color='country',
           size = 'year',title='Life Expectancy and Schooling')

This visualization expresses that an inscreased amount of schooling dramatically increases life expectancy. Increased education raises standard of living, a healthy conscious, more income/access to health care, and much more.¶

In [31]:
# modeling life_expectancy and population below via linear regression

schooling_model = smf.ols(formula = 'life_expectancy ~ schooling',
               data = dataset).fit()
print(schooling_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.565
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     3599.
Date:                Thu, 29 Dec 2022   Prob (F-statistic):               0.00
Time:                        22:30:17   Log-Likelihood:                -8964.3
No. Observations:                2768   AIC:                         1.793e+04
Df Residuals:                    2766   BIC:                         1.794e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     44.1089      0.437    100.992      0.000      43.252      44.965
schooling      2.1035      0.035     59.995      0.000       2.035       2.172
==============================================================================
Omnibus:                      283.391   Durbin-Watson:                   0.267
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1122.013
Skew:                          -0.445   Prob(JB):                    2.28e-244
Kurtosis:                       5.989   Cond. No.                         46.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

56% of the variance in life expectancy in this model is explained by schooling. Schooling has a positive correlation coefficient of 43% with gdp, and 43% with income. Schooling, in general, is linked to better outcomes¶

Predict schooling and based on life expectancy¶

In [32]:
#creating new dataset without the NAs in schooling.

prediction_data = dataset[['life_expectancy', 'schooling']].dropna()

# Test train split for supervised learning
X_train, X_test, y_train, y_test = train_test_split(prediction_data.life_expectancy, prediction_data.schooling)
In [33]:
# Test Train split visualization
plt.figure(figsize=(9,6))
plt.scatter(X_train, y_train, label = 'Training Data',color ='r', alpha =.7)
plt.scatter(X_test, y_test, label = 'Testing Data', color ='g', alpha =.7)
plt.legend()
plt.title('Life expectancy and Schooling')
plt.xlabel('Age')
plt.ylabel('Schooling')
plt.show()
Figure
In [34]:
# create linear model and train it
LR = LinearRegression()
LR.fit(X_train.values.reshape(-1,1), y_train.values)
Out[34]:
LinearRegression()
In [35]:
# Use model to predict on test data
prediction = LR.predict(X_test.values.reshape(-1,1))

#polt prediction line against actual test data
plt.figure(figsize=(9,6))
plt.plot(X_test, prediction, label = 'Linear Regression', color = 'b')
plt.scatter(X_test, y_test, label = 'Actual Test Data',color= 'g',alpha= .7)
plt.title('Life expectancy and Schooling')
plt.xlabel('Average Life Expectancy by Country')
plt.ylabel('Schooling')
plt.legend()
plt.show()
Figure

Predict the schooling level of country based on a life expectancy of 55¶

In [36]:
# putting age 55 into the prediction. The output if the school level (in years of schooling)

print('A country with a life expectancy of 55 is predicted to experience about',
      round(LR.predict([[55]])[0],2), 'years of schooling')
A country with a life expectancy of 55 is predicted to experience about 8.17 years of schooling
In [ ]:
 

What is the affect of adult mortality on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [37]:
px.scatter(dataset, x='life_expectancy',y='adult_mortality',color='country',
           size = 'year',title='Life Expectancy and Adult Mortality')

The plot above displays the sentiment that, the lower the adult mortality, the higher the life expectancy¶

In [38]:
# modeling life_expectancy and adult_mortality below via linear regression

adult_model = smf.ols(formula = 'life_expectancy ~ adult_mortality',
               data = dataset).fit()
print(adult_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.485
Method:                 Least Squares   F-statistic:                     2755.
Date:                Thu, 29 Dec 2022   Prob (F-statistic):               0.00
Time:                        22:30:19   Log-Likelihood:                -9782.0
No. Observations:                2928   AIC:                         1.957e+04
Df Residuals:                    2926   BIC:                         1.958e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          78.0182      0.210    371.804      0.000      77.607      78.430
adult_mortality    -0.0534      0.001    -52.485      0.000      -0.055      -0.051
==============================================================================
Omnibus:                     1021.341   Durbin-Watson:                   0.762
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3874.676
Skew:                          -1.703   Prob(JB):                         0.00
Kurtosis:                       7.490   Cond. No.                         343.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Only 48.5% of the variance of life expectancy in this model is explained by adult mortality, according to the R-Sqaured value. With a very low p-value, the results are significant. the relationship is very clear, the less deaths, the older the population grows. This ties into the previously noted relationship between life expectancy and years, where, naturally, nations will live longer as the years go on, but death is an inhibitor of the growth of average age¶

In [ ]:
 

What is the affect of infant deaths on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [39]:
px.scatter(dataset, x='life_expectancy',y='infant_deaths',color='country',
           size = 'year',title='Life Expectancy and Infant Deaths')
In [40]:
# modeling life_expectancy and infant_deaths below via linear regression

infant_deaths_model = smf.ols(formula = 'life_expectancy ~ infant_deaths',
               data = dataset).fit()
print(infant_deaths_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.038
Method:                 Least Squares   F-statistic:                     117.6
Date:                Thu, 29 Dec 2022   Prob (F-statistic):           6.88e-27
Time:                        22:30:21   Log-Likelihood:                -10696.
No. Observations:                2928   AIC:                         2.140e+04
Df Residuals:                    2926   BIC:                         2.141e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        69.7069      0.178    391.102      0.000      69.357      70.056
infant_deaths    -0.0158      0.001    -10.844      0.000      -0.019      -0.013
==============================================================================
Omnibus:                      166.073   Durbin-Watson:                   0.162
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              193.261
Skew:                          -0.624   Prob(JB):                     1.08e-42
Kurtosis:                       2.837   Cond. No.                         126.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

with a R squared value of only .039, infant deaths only explain 3.9% of the variance in life_expectancy. Infants range from ages 0-1 years old. Theres much more to the world's life expactancy results than the results infant deaths, a small population in any nation. However, with a very low p-value, the results are significant.¶

In [ ]:
 

What is the affect of population on life expectancy?¶

In [41]:
# modeling life_expectancy and population below via linear regression

pop_model = smf.ols(formula = 'life_expectancy ~ population',
               data = dataset).fit()
print(pop_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.045
Date:                Thu, 29 Dec 2022   Prob (F-statistic):              0.307
Time:                        22:30:22   Log-Likelihood:                -8472.5
No. Observations:                2288   AIC:                         1.695e+04
Df Residuals:                    2286   BIC:                         1.696e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     68.7216      0.210    327.628      0.000      68.310      69.133
population -3.442e-09   3.37e-09     -1.022      0.307      -1e-08    3.16e-09
==============================================================================
Omnibus:                      115.164   Durbin-Watson:                   0.171
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              120.077
Skew:                          -0.529   Prob(JB):                     8.43e-27
Kurtosis:                       2.625   Cond. No.                     6.36e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.36e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

With a r-squared value of 0, 0% of variance in life expectancy is explained by population in this linear model. This outcome is also on par with the previous correlation plots. Life expectancy has a -0.04 relationship with population. More or less people in a nation does not increase or decrease life expectancy¶

In [ ]:
 

What is the affect of income on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [42]:
px.scatter(dataset, x='life_expectancy',y='income_composition_of_resources',color='country',
           size = 'year',title='Life Expectancy and Income')

Income has a very strong correlation with life expectancy, according to the visual above¶

In [43]:
# modeling life_expectancy and income below via linear regression

income_model = smf.ols(formula = 'life_expectancy ~ income_composition_of_resources',
               data = dataset).fit()
print(income_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.525
Model:                            OLS   Adj. R-squared:                  0.525
Method:                 Least Squares   F-statistic:                     3061.
Date:                Thu, 29 Dec 2022   Prob (F-statistic):               0.00
Time:                        22:30:23   Log-Likelihood:                -9086.7
No. Observations:                2768   AIC:                         1.818e+04
Df Residuals:                    2766   BIC:                         1.819e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept                          49.1735      0.385    127.809      0.000      48.419      49.928
income_composition_of_resources    32.1572      0.581     55.325      0.000      31.018      33.297
==============================================================================
Omnibus:                      303.292   Durbin-Watson:                   0.338
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1554.927
Skew:                           0.395   Prob(JB):                         0.00
Kurtosis:                       6.586   Cond. No.                         6.67
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

52% of the variance of life expectancy in this model is explained by income. More income provides access to better standards of living, schooling, a larger network, health care, less financial stress¶

In [ ]:
 

What is the affect of GDP on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [44]:
px.scatter(dataset, x='life_expectancy',y='gdp',color='country',
           size = 'year',title='Life Expectancy and GDP')

Visualized above - as life expectancy gets higher, so does the gdp.¶

In [45]:
# modeling life_expectancy and income below via linear regression

gdp_model = smf.ols(formula = 'life_expectancy ~ gdp',
               data = dataset).fit()
print(gdp_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.213
Model:                            OLS   Adj. R-squared:                  0.213
Method:                 Least Squares   F-statistic:                     683.6
Date:                Thu, 29 Dec 2022   Prob (F-statistic):          1.44e-133
Time:                        22:30:25   Log-Likelihood:                -9026.2
No. Observations:                2528   AIC:                         1.806e+04
Df Residuals:                    2526   BIC:                         1.807e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     66.8970      0.193    346.913      0.000      66.519      67.275
gdp            0.0003   1.21e-05     26.146      0.000       0.000       0.000
==============================================================================
Omnibus:                      156.820   Durbin-Watson:                   0.378
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              186.682
Skew:                          -0.666   Prob(JB):                     2.90e-41
Kurtosis:                       2.986   Cond. No.                     1.80e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

21% of the variance of life expectancy in the model is explained by the gdp. a country with higher gdp must be relative to a country with a higher income - thus, it's citizens have better standards of living, better food, and access to health care and schooling. There are many other factors at play alongside the gdp¶

In [ ]:
 

What is the affect of Alcohol on life expectancy?¶

The code in below will generate an interactive plot. Click on a single (or multiple) country name in the legend to display data. You can double-click the legend to display all data or remove all data¶

In [46]:
px.scatter(dataset, x='life_expectancy',y='alcohol',color='country',
           size = 'year',title='Life Expectancy and Alcohol')

There appears to be a positive correlation between life expectancy and alcohol. There is correlation coefficient of 39% between the two¶

In [47]:
# modeling life_expectancy and income below via linear regression

alcohol_model = smf.ols(formula = 'life_expectancy ~ alcohol',
               data = dataset).fit()
print(alcohol_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.162
Model:                            OLS   Adj. R-squared:                  0.162
Method:                 Least Squares   F-statistic:                     563.8
Date:                Thu, 29 Dec 2022   Prob (F-statistic):          4.49e-114
Time:                        22:30:27   Log-Likelihood:                -10423.
No. Observations:                2912   AIC:                         2.085e+04
Df Residuals:                    2910   BIC:                         2.086e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     65.0583      0.241    270.352      0.000      64.587      65.530
alcohol        0.9385      0.040     23.745      0.000       0.861       1.016
==============================================================================
Omnibus:                      270.852   Durbin-Watson:                   0.199
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              348.774
Skew:                          -0.831   Prob(JB):                     1.84e-76
Kurtosis:                       3.335   Cond. No.                         9.25
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

16% of the variance in life expectancy in this model is explained by alcohol consumption. Could it be that, being able to afford alochol correlates with income? The answer is Yes. There is a positive correlation between a country's alcohol consumption and income with a correlation coefficient of 42%. Those who have the luxury of affording alcohol, are more likely to have higher income and higher standard or living, appearantly¶

In [ ]:
 

Overall, an increase in GDP, schooling, income, adulty mortality and time (in the form of years) are some of the pillars that correlate positively with life expectancy. These pillars, in short, allow folks to live a life of higher quantity (maybe quality, as well), with access to health care, knowledge of health care, and even access to more alcohol!¶